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Abstract 

Background: Ligand-based in silico target fishing can be used to identify the potential interacting target of 
bioactive ligands, which is useful for understanding the polypharmacology and safety profile of existing drugs. The 
underlying principle of the approach is that known bioactive ligands can be used as reference to predict the 
targets for a new compound. 

Results: We tested a pipeline enabling large-scale target fishing and drug repositioning, based on simple fingerprint 
similarity rankings with data fusion. A large library containing 533 drug relevant targets with 179,807 active ligands 
was compiled, where each target was defined by its ligand set. For a given query molecule, its target profile is 
generated by similarity searching against the ligand sets assigned to each target, for which individual searches 
utilizing multiple reference structures are then fused into a single ranking list representing the potential target 
interaction profile of the query compound. The proposed approach was validated by 10-fold cross validation and 
two external tests using data from DrugBank and Therapeutic Target Database (TTD). The use of the approach was 
further demonstrated with some examples concerning the drug repositioning and drug side-effects prediction. The 
promising results suggest that the proposed method is useful for not only finding promiscuous drugs for their new 
usages, but also predicting some important toxic liabilities. 

Conclusions: With the rapid increasing volume and diversity of data concerning drug related targets and their 
ligands, the simple ligand-based target fishing approach would play an important role in assisting future drug 
design and discovery. 
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Background 

For many decades, the drug discovery and development 
have been directed by the idea of 'one drug-one target- 
one disease'. The paradigm is shifting since many drugs 
elicit their therapeutic activities by modulating multiple 
targets, as indicated by the polypharmacology [1-3]. 
However, multi-target interactions are either unknown 
or insufficiently understood in most cases, which in- 
spired many efforts to predict and characterize drug- 
target associations. 
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Use of in silico tools to predict targets of small mole- 
cules has drawn more and more attentions in recent 
years. These predicted drug targets can be divided into 
two types: I) unexploited novel drug targets that can be 
used alone or with other drugs in combination chemo- 
therapy treatment [3]; II) existing drug targets that pro- 
vide new uses and indications for existing drugs [4]. One 
of the most prominent examples for drug repositioning 
is Sildenafil, which was initially developed for use for 
hypertension and angina, and then repositioned for the 
treatment of male erectile dysfunction [5]. Other notable 
drug repositioning examples include Memantine [6], 
Buprenorphine [7], Requip [8,9], Colesevelam [10], and 
so on. Numerous computational strategies for target 
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fishing have been published. These studies enable re- 
searchers to deepen the understanding of the bioactive 
space of new chemical entities, which provide an efficient 
way in designing ligands with favorable pharmacological 
and safety profile. Generally, available target fishing ap- 
proaches fall into the following two major categories: 

1 . Target-based Methods 

Target-based methods use the information of target pro- 
teins, which includes molecular docking, similarity com- 
parison of protein sequence or binding pocket, and so 
on. For example, INVDOCK [11] and TarFisDock [12] 
screen a query small molecule against a panel of prede- 
fined target protein structures whereby putative targets 
are sorted by docking score [13]. This approach has been 
demonstrated to be useful in target identification, and 
some of the predicted results have been verified by bio- 
assay and crystallographic studies [14,15]. Although sig- 
nificant improvements have been made in this area, 
there are still practical limitations for target structure- 
based approaches, such as unavailable crystal structures 
(especially for most trans-membrane proteins), high false 
positive rate, the choice of an appropriate scoring func- 
tion and high requirement of computational resources 
[16]. To circumvent these issues, several target-based 
methods relying on the analysis of existing drug-target 
interaction data have been developed. For instance, Luo 
et al. developed a web server DRAR-CPI to identify drug 
repositioning and adverse drug reactions by mining 
chemical-protein interactome [17]. Milletti et al. [18] 
and Wang et al [19] predicted polypharmacology by 
comparing the structural similarity of binding sites. 
Recently, Jacob et al [20] and Wang et al [21] con- 
structed chemogenomics approaches for qualitatively 
predicting ligand-protein interaction that only require 
the primary sequence of proteins and the structural fea- 
tures of small molecules. These approaches transform 
the target fishing problem to a machine learning prob- 
lem in the ligand-target space. Though potentially use- 
ful, they are sensitive to how a given target protein or 
ligand-protein pair is represented by descriptor vectors, 
and have a limited application domain defined by their 
training set range. 

2. Ligand-based Methods 

Ligand-based methods simplify the problem to a similarity 
searching problem, and only use ligand information to 
predict target. Compared with the structure-based ap- 
proaches, ligand-based approaches do not rely on the 
complete knowledge of ligand-target interaction mecha- 
nisms and requires relatively low computational cost. 
Based on how a given ligand is represented, these methods 
can be divided as 2D fingerprint, molecular shape, pharma- 
cophore, and bioactivity spectrum-based, etc. 



Chemically similar drugs often bind biologically rele- 
vant protein targets. To uncover the pharmacological re- 
lationships among proteins, Keiser et al developed a 
statistics-based chemoinformatics approach called simi- 
larity ensemble approach (SEA) [22], in which each tar- 
get was represented solely by the structures of its set of 
known ligands. SEA has been applied to quantitatively 
identify pharmacological links between targets by the 
similarity of the ligands bind to them, expressed as ex- 
pectation values (E-value). It was further successfully ap- 
plied to large-scale test for drug repurposing [23]. 
Furthermore, three dimensional (3D) molecular shape 
descriptors have turned out to be especially successful in 
describing and comparing molecular profiles. Abdul 
Hameed et al developed a novel approach by comparing 
shape similarity using program ROCS [24]. In their ap- 
proach, target profiles were generated for a given query 
molecule by computing the maximal 3D-shape and 
chemistry-based similarity to the collection of drugs 
assigned to each protein target [25]. Pharmacophore, like 
molecular docking, can also be reversely used for in 
silico drug target identification. Recently, Liu et al re- 
ported a free web interface PharmMapper that uses 
pharmacophore to predict protein targets for small mol- 
ecules [26]. This approach automatically performs re- 
verse mapping against the deposited pharmacophore 
models and outputs the top ranked hits. With the rapid 
growth in bioactivity data of small molecules and their 
targets, it is possible to employ the information to infer 
targets for drugs or bioactive compounds. Cheng et al 
developed an approach named bioactivity profile similar- 
ity search (BASS), for associating targets to small mole- 
cules by comparing the bioactivity profiles that are 
derived from the NCI-60 cell lines [27]. 

A notable strategy for similarity searching is data fusion 
(DF) that utilizes multiple reference structures to search 
against a database. A DF process is to combine the infor- 
mation provided by multiple independent sensors in order 
to make judgments on an event, which was firstly pro- 
posed by Peter Willett and his coworkers [28]. Afterwards, 
Whittle et al [29] and Hert et al [30] used 2D fingerprint 
similarity ranking with DF for virtual screening, and dem- 
onstrated its effectiveness over conventional similarity 
searching in scaffold-hopping searches for structurally di- 
verse sets of active molecules [31]. Due to its high search- 
ing quality and low computational cost, this approach is 
especially fit for the exponential growth in biological data. 

Although many advances have been made over the last 
decades, drug target prediction is still a very challenging 
task as reflected by the low clinical target validation 
success rate. The reasons are manifold, yet what poses 
the greatest difficulty might be the amount of protein 
targets and known active small molecules. For example, 
the current version ChEMBL database (version 17) [32] 
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contains 12,077,491 bioassay data for 9,356 targets and 
1,324,941 compounds. Such data collection is so large 
and complex that it becomes difficult to process using 
traditional molecular modeling process and target-ligand 
interaction applications. In this regard, we may consider 
the target fishing as a 'big data problem. As defined by 
Donglas Laney [33], big data problems mainly have three 
aspects of features of data growth, i.e. having increasing 
volume (amount of data), velocity (speed of data in and 
out), and variety (range of data types and sources). For 
target fishing, vast amount of data in various measure- 
ments (I<i, K d , IC 50 , inhibition rates and so on) are being 
generated daily from different sources, with the fast de- 
velopment of many high-throughput bioassay systems. 
Given the data of these features, we need firstly make a 
practical trade-off between the amount of employed data 
and the complexity of models. Though under debate, it 
has been widely realized that using more data is more 
beneficial, because it provides the contextual richness in 
data and does not rely on unproven assumptions and 
weak correlations. From this aspect, we may argue that 
more emphasis should be placed on the data set used 
for target fishing, instead of developing algorithms that 
are more sophisticated. In this study, we try to address 
the target fishing problem from the 'big data perspec- 
tive. A large reference ligand library is first established, 
with each ligand set to represent a single target. Here, 
the DF strategy is adopted to calculate the highest K 
similarity scores (or their average value) between the 
query and the ligands sets in reference library, using 
Tanimoto coefficients (Tc) of ECFP4 fingerprints. The 
value of K can be 1, 3, and 5 (denoted as Max, 3NN, and 
5NN, respectively), and the average fusion similarity is 
a centroid score, which is described in Methods sec- 
tion. The target profile of a query chemical is then 
provided according to the ranked fusion scores. The 
performance of this scheme is tested on two test sets, 
and a further validation is made to identify new (off-) 
targets and hERG related toxicity. The aim of the 
study is to benchmark the target fishing capability by 
using a simple ligand-based similarity searching ap- 
proach, in the meantime, by employing the available 
data as much as possible. 

The SEA approach represents a notable recent advance 
in identifying protein targets. Here, a locally implemented 
SEA approach was run in parallel with our approach for 
accuracy assessment, of which the E-value was used to 
rank potential targets [22]. We perform this comparison 
because both SEA and our approach use active ligand set 
to represent target, and use 2D fingerprint based similarity 
to obtain the score of a target (The SEA approach can be 
considered as a data fusion scheme, where the score of a 
target is normalized by the size of its ligand set). It should 
be pointed out that SEA requires that the product of the 



ligand set sizes is not less than 100 to guarantee statisti- 
cally reliable result [34]. It means that the current SEA is 
not appropriate for fishing targets without sufficient refer- 
ence ligands. Nevertheless, its result can serve as a control 
to see how existing approaches perform on the current 
data set. 

Results and discussion 

Drug related targets (DRTs) and their ligands are favor- 
able sources for analyzing target-ligand interaction and 
understanding polypharmacological effects of drugs. As 
described in Methods section, one reference library con- 
taining DRTs with active ligand set and two validation 
sets were compiled for this study. Table 1 summarizes 
the number of compounds and targets of each set. We 
further analyzed the polypharmacological profile of the 
ligands in the reference library. As shown in Figure 1, 
most active ligands have only one single target, while 
there are also significant amount of ligands having two 
or three targets. The number of ligands having many tar- 
gets is small, and 1,512 ligands have the number of tar- 
gets greater than five. 

Given a ligand that has m experimentally verified tar- 
gets, a target fishing scheme yields n predicted targets 
for the ligand (Le., the top ranked n targets), we used 
the following evaluation metrics to measure the per- 
formance of the scheme: Precision {PR n ), Recall (RE n ), 
F-measure (F n ) [35], and the uninterpolated precision 
(PR) [36]. PR' is given by the averaged precision values 
PRi from the ranking places 1 to m. Here, m for a query 
ligand is the number of its interacting targets. The de- 
tailed definitions of these terms are provided in the 
Methods section. 

1 . Ten-fold cross-validation 

The 10-fold cross validation was performed to determine 
the parameter K for the nearest neighbor to fuse and to 
evaluate the effectiveness of the fusion strategy when 
only a small part of the set was used as reference. In the 
validation, the overall reference set was randomly split 
into ten parts. For the ligands of each part, their targets 
were predicted using the ligands and the targets infor- 
mation of the rest 9 parts. The performance achieved for 
each part was recorded, and the average PR' value was 
used to evaluate the four fusion schemes as well as SEA 

Table 1 Statistics of the data sets used as reference 



ligand library and for external validation sets 





Data 
Sources 


Target 


Actives 
(or drugs) 


Pairs 


Ref. 


Reference library 


BindingDB 


533 


1 79807 


246053 


[40] 


External validation 


DrugBank 


455 


711 


7917 


[41] 
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255 
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1084 



Liu et al. Journal of Cheminformatics 2014, 6:33 
http://www.jcheminf.eom/content/6/1/33 



Page 4 of 14 



140000 



120000 - 
100000 - 

CO 
0) 

> 

^ 80000 - 
A3 

M— 

O 

L_ 

0> 

_Q 60000 - 

E 
-z. 

40000 - 



20000 - 



Number of targets per active 
Figure 1 Plot to show the distribution of the number of active ligands against the number of targets per ligand. 



for comparison. Since SEA calculates the set-wise simi- 
larity among ligand sets, it would not be statistically reli- 
able if the ligand sets comprising fewer than ten ligands. 
In practice, it is suggested that the product of the set 
sizes should be higher than 100 [22,34]. In the case of 
target fishing, the set-wise similarity is typically calcu- 
lated between a single query ligand and a reference lig- 
and set. So in order to perform a comparison, we did 
another test for SEA only using the reference targets 
whose ligands are equal to or more than 100. Altogether, 
292 targets with 173,862 ligands (23,6986 pairs) were 
retained as another unique reference set to test SEA per- 
formance. As outlined in Table 2, Max scheme performs 
a little worse than 3NN and 5NN schemes. Both 3NN 
and 5NN show very close results and 3NN slightly out- 
perform the others. A possible explanation is that 5NN 
includes some ligands that are not very similar to the 
query. Instead, the first three neighbors of a query in ref- 
erence ligand set may better represent the corresponding 
target, and discriminate among other optional targets. 
Moreover, it is clear that Centroid score is less effective 
than other KNN schemes. In our experiment, SEA 



Table 2 The result [PR*) of 10-folds cross-validation on 
the reference set 





Max 


3NN 


5NN 


Centroid 


SEA 


SEA* 


Mean 


0.927 


0.950 


0.947 


0.228 


0.628 


0.717 


S.E. 


0.002 


0.002 


0.002 


0.002 


0.001 


0.001 



*Only the targets containing more than 100 reference ligands were considered 
in the validation. 

PR' is given by the averaged precision values PRj from the ranking places 1 to 
m. Here, m for a query ligand is the number of its interacting targets. 



performs a bit worse than KNN fusion. As expected, 
KNN strategy is able to determine the target of small 
molecules with significant accuracy and robustness in 
internal cross validation. 

Since the targets here are represented by their refer- 
ence ligands, the predictive ability relies on the represen- 
tativeness and diversity of reference library. Figure 2 
displays a bar plot of the number of active ligands for all 
the targets in the reference set. Among all the 533 tar- 
gets having more than 10 active ligands, 292 of them 
have more than 100 active ligands and 72 of them have 
more than 1000 active ligands. These 533 approved drug 
targets cover the major members of clinical therapeutic 
protein receptors, enzymes and disease related targets. 
With the amount of bioassay data growing, our refer- 
ence library can be easily extended to incorporate more 
ligand-target interaction data. Then, we would like to 
know how 3NN behaves on targets with different num- 
bers of reference ligands. We studied the PR' in 10-CV 
by grouping the reference targets evenly into 10 bins ac- 
cording to the amount of its ligands. The PR' of a bin is 
defined as the average PR' score for all the targets in that 
bin. The yellow line shown in Figure 2 depicts how the 
PR' varies across targets with ascending number of refer- 
ence ligands. As the number of reference ligands in- 
creases, PR' increases and the error bar decreases, 
suggesting 3NN tends to perform better for the targets 
with a large number of reference ligands. Overall, the 
PR' scores range from 0.8 to 0.96, demonstrating that 
3NN has excellent accuracy for fishing targets with ad- 
equate reference ligands. At the same time, we may also 
find that the approach showed a reasonable performance 
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Figure 2 Plot to show the distribution of average PR' against the targets with increasing number of reference ligands. PR' is given by 
the averaged precision values PR, from the ranking places 1 to m. Here, m for a query ligand is the number of its interacting targets. 



on targets with a small number of reference ligands. As 
shown in Figure 2, the majority of targets collected in 
our reference set have adequate reference ligands ran- 
ging from 10 2 to 10 3 , which ensured high predictive abil- 
ity of the approach. 

2. The performance of 3NN with increasing size of 
reference Set 

With increasing number of active ligands available in 
various databases, we want to revisit target prediction in 
the context of a 'big data. The reference set we collected 
contains more than a hundred thousand active ligands 
and it will definitely grow in the future. Therefore, we 
would like to investigate the performance change of 
3NN as more reference compounds become available. 
As every target is represented by a set of active ligands, 
we created a test set of 2665 ligands by randomly pick- 
ing five ligands from each of the 533 targets. The 
remaining reference ligands were used as the reference 
set. In order to study 3NN on different sizes of reference 
sets, a total number of eight reference sets were made 
by randomly sampling 0.1%, 1%, 5%, 10%, 20%, 50%, 80% 
and 100% from the remaining reference ligands. Then, 
we ran 3NN utilizing each of reference sets and record 
their PR' values. The experiment was repeated five times 
and the overall result was depicted in Figure 3. When 
only 0.1% information is used, the average PR' is close to 
0 with a small SD of 0.2, suggesting that most of the tar- 
gets cannot be identified by the small reference set. The 
PR' is gradually increasing as more reference information 
involved. When 10% references were used, the average 
PR' is more than 0.6 but the error bar is relatively larger 
(SD = 0.44). It suggests that the prediction accuracy of 
3NN may reach a high level if the related reference set is 
of considerable volume, but the variance of prediction is 



also large, which is also the result that would occur in 
most cases for similarity-based approaches. When 50% 
reference ligands were used, we may find that the PR' 
had a notable increase to 0.89 and SD reduced to 0.29, 
which indicates that more targets could be identified for 
test molecules. It is also worth noticing that the line flat- 
tens and the error bar decreases when more than half of 
the references were used, which shows that there is only 
marginal gain in prediction performance (average PR) if 
the related reference set is of a sufficient large volume. 
Finally, by using the total reference set, the average PR' 
and SD are 0.96 and 0.18, respectively. The test demon- 
strates that the more prior knowledge may not only im- 
prove the prediction accuracy of 3NN fusion strategy in 
target predicting, but also reduce its prediction variance. 

Given a query compound, it is interesting to investi- 
gate how similar ligands the reference set contains and 
whether the similarity will affect target predicting. For 
the test set containing 2665 compounds, we checked the 
variation of PR' versus the similarity of a query to its 
closest neighbor in its corresponding reference ligand 
set, as shown in Figure 4. On the one side, we may find 
the 3NN model always gave a high PR' value if the query 
can find close neighbors. This observation suggests that 
a sufficiently large and diverse reference library is im- 
portant for the predicting accuracy, which explains why 
the target fishing problem should be addressed from a 
"big data" perspective. On the other side, we may also 
notice that the 3NN model is robust, as the PR' reaches 
0.65 for those have 0.4 ~ 0.5 similarity scores. It means 
that the model is still useful when the query can only re- 
trieve some moderately similar ligands. 

In general, 3NN achieves high PR' in the internal val- 
idation, which is partially attributed to the close ana- 
logues that exists in the both test and reference sets. To 
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assess the performance of the approach further on prac- 
tical cases, two external tests are performed and ana- 
lyzed in the following section. 

3. Predicting targets for approved drugs from drug bank 
and TTD 

Many drugs from a wide range of therapeutic areas have 
more than one interacting targets, and the multiple on- 
target and off-target bindings are essential for their efficacy 



and side effects. For example, the number of reported 
interacting targets for the drugs treating central nervous 
system disorders is even up to 64 in our validation set. We 
compared 3NN and SEA for target identification for these 
multi-target drugs. For each test compound, we considered 
the top 20 predicted targets, in terms of the metrics includ- 
ing PR W RE W F n and PR\ The averaged results of the 711 
drugs presented in the DrugBank and 476 drugs in TTD 
are reported. 
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Figure 4 The bar plot showing the variation of PR' versus the similarity of a query to its closest neighbor in its corresponding 
reference ligand set. This analysis is based on the test set containing 2665 query ligands. 
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Figure 5(A-D) show the performance changes of the 
top 20 predictions for the DrugBank and TTD ligands. 
Clearly, the 3NN scheme performs better in the valid- 
ation, of which all the metrics are consistently higher 
than those of SEA. As depicted, the results of the two 
approaches show a similar trend except for the following 
minor differences: For the 3NN scheme, gradual de- 
creasing PR n and increasing RE n are observed, and the 
changes are more significant when n is small. For SEA, 
PR n decreases more quickly, and its value is close to 
zero when n is larger than 10. RE n of SEA shows a dif- 
ferent appearance, where it only exhibits increase when 
n is less than 5, and then leveled off. These results 
mean that SEA can only identify the "true targets" in a 
few top ranked predictions, and further increasing the 
number of predictions will not yield more "true tar- 
gets". In contrast, 3NN is able to retrieve more experi- 
mentally observed targets as it allows more predictions, 
demonstrating its advantage in addressing the poly- 
pharmacology of drugs. 

Figure 6(A) and 6(B) show the F n curves of 3NN and 
SEA for drugs from DrugBank and TTD, respectively. 
There are a few differences between these two validation 
sets DrugBank and TTD. TTD mainly contains the 
primary targets directly related to the therapeutic actions 
of approved drugs, while DrugBank collects more 
comprehensive potentials targets. In addition to the tar- 
gets that confer the desired pharmacological effect, 



DrugBank also contains other targets including metabol- 
ism enzymes, carrier and transporters. These targets 
usually account for the side effects or drug-drug interac- 
tions. A comparison of these two sets shows that 64% 
drugs in TTD also present in Drugbank. Therefore, we 
may consider TTD as a subset of DrugBank, in which 
TTD only includes pharmacological targets, and Drug- 
Bank includes more comprehensive interaction targets. 
Further inspection of Figure 6 reveals that the 3NN 
for DrugBank displayed a different pattern on these 
two sets. As shown in Figure 6(A), the F n of 3NN 
achieves its maximum value for DrugBank when n 
equals to 6 (the vertex of the curve). Clearly, we may 
find that the F n curves of 3NN are consistently higher 
than SEA, suggesting its higher performance on pre- 
dicting drug targets. 

From Figure 6(B), we may also notice that 3NN and 
SEA show a similar tendency on TTD, of which the F n 
curves rapidly decline when n > 3. It suggests that the 
therapeutic targets can be well identified in the top three 
predictions, and considering more targets ranked outside 
the top three would result in a significant number of 
false predictions. However, if one aims to predict non- 
therapeutic targets as well, the prediction rank list 
should be extended. As shown in Figure 6(A), the de- 
cline of 3NN is still slow when n > 6. Another point of 
notice is that for both 3NN and SEA the maximum F n 
value obtained on TTD is higher that on DrugBank. This 





Figure 5 Comparison of PR n by 3NN and SEA for the drugs from: (A) DrugBank set and (B) TTD set; Comparison of RE n by 3NN and 
SEA for the drugs from: (C) DrugBank set and (D) TTD set. 
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observation suggests that therapeutic targets could be 
more reliably predicted. One of the possible reasons 
is that the therapeutic targets usually form specific 
interaction with their corresponding drugs with high 
affinities. However, the non-therapeutic targets, e.g. 
CYP450s, may exhibit enormous promiscuity, and they 
interact with a huge range of structurally unrelated li- 
gands. The weak and non-specific interactions may lead 
to inferior performance on predicting drugs interacting 
with these targets. More details about this test are pro- 
vided in Additional file 1: Table SI and Table S2. 

Alternative target fishing methods include 3D similarity 
searching methods as well as those based on machine 
learning. The 3D similarity searching methods rely on the 
generation of active conformations for both references 
and queries, which are difficult to obtain for some flexible 
compounds and involve high computational cost. For the 
machine learning methods, both known active and in- 
active molecules should be present to form a training set. 
However, the true inactive data are hardly available in 
most public databases, thus significantly restricted their 
usages in target fishing. In comparison, the 2D similarity 
searching methods only require the positive data and the 
chemical fingerprints fast to compute, making it an effi- 
cient method for large-scale target fishing. 

4. Identification of New (Off-) target-drug interactions 

From the previous analysis, we may notice that 3NN 
DF scheme based on a large reference set is suitable for 
the ligands with multiple targets. It is therefore inter- 
esting to investigate its performance on identifying the 
new and off-targets from the experimentally verified 
drug-target associations. To this end, we tested 3NN 
using Keiser s data that were previously used to verify 
the prediction of SEA [23]. The first test set includes 
within-boundary predictions for 10 GPCR drugs and 
cross-boundary predictions for 4 non-GPCR drugs, and 
the second set includes 32 drugs with 39 off-targets 
associations. 

Table 3 shows the 3NN rankings of the new targets for 
the drugs. We noticed that 62 out of 65 new drug-target 



associations can be "hit" at the top 20 predictions, and 
most of the targets are predicted in top 1-6. This result 
is consistent with the vertex shown in Figure 6(A) that 
3NN could achieve a good predictive ability in top 1-6. 
It also means that new (off-) targets could be success- 
fully identified in nearly top 1% of the full set of 533 tar- 
gets. Only a few experimentally verified target-ligand 
interactions were ranked outside of top 20. For example, 
the interaction of p 2 adrenergic receptor with Prozac 
and Paxil was ranked at the 39th and 75th places by 
3NN, respectively. These two drugs were predicted to 
interact with (3 adrenergic receptor by SEA, and later ex- 
periments revealed that they are medium-potency 
blockers of p 2 subtype (i.e., K { = 4.4 uM for Prozac and 
10 uM for Paxil). Since our DRT reference set mainly fo- 
cuses on strong binders to a specific protein, the low 
rankings of the target for these two drugs may be par- 
tially attributed to their low binding affinities to the 
target, which are close to the threshold for selecting li- 
gands in our reference set. Another important feature 
of our DRT reference set is it includes more target 
members that were categorized according to their se- 
quence similarities. Compared with the reference set 
of SEA, our DRT reference set specifies the three sub- 
types of (3 adrenergic receptor, hence not requiring a 
separate assay for each one. Actually, for Prozac and 
Paxil, their interactions with p 2 were ranked highest 
among the three subtypes, which are consistent with 
Keiser s assay results. 

5. hERG toxicity prediction 

Off-target interactions are typically related to adverse 
drug effects, among which a prominent example is the 
interaction of numerous compounds with hERG, a po- 
tassium ion channel expressed in the heart and in ner- 
vous tissue. In the past decade, a frequent cause of the 
withdrawal of the marketed drugs has been the poten- 
tially fatal arrhythmia that is induced by a blockage of 
hERG channels [37,38]. In this study, we further investi- 
gated the hERG -related off-target prediction using the 
3NN target fishing scheme. Table 4 list nine approved 
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Table 3 Target ranking results of the 3NN scheme on the novel drug-target association set of Keiser et al. [23] 

Drug Pharmacological Action New (off-) targets 3NN 

Ranking 

New aminergic Sedalande Neuroleptic ae adrenergic blocker 3 

GPCR targets rllTin 

3 5-HT1D antagonist 1 

Dimetholizine Antihistamine; antihypertensive an adrenergic blocker 5 

5-HT1 A antagonist 1 

D2 antagonist 2 

Kalgut Cardiotonic (3a adrenergic agonist 1 

Fabahistin Antihistamine 5-HT5A antagonist 3 

Prantal Anticholinergic; antispasmodic 5ntichol agonist 1 

N,N-dimethyltryptamine Serotonergic hallucinogen 5-HT1 B agonist 1 

5-HT2A agonist 2 

5-HT5A antagonist 13 

5-HT7 modulator 1 1 

Doralese Adrenergical blocker; antihypertensive; D 4 antagonist 6 

antimigraine 

Prozac 5-HT reuptake inhibitor; antidepressant |3 adrenergic blocker 39 

Motilium Antiemetic; peristaltic stimulant ae adrenergic blocker 6 

Paxil 5-HT reuptake inhibitor; antidepressant (3 adrenergic blocker 75 

New cross-class Xenazine ae (transporter) ae adrenergic receptor (GPCR) 6 

tarqets 

Rescriptor H IV- 1 reverse transcriptase (enzyme) H4 receptor (GPCR) 4 

Vadilex NMDAR (ion channel) uMDAR (j receptor (GPCR) 16 

4 

5-H^ 14 
SERT(transporter) 

RO-25-6981 NMDAR (ion channel) 5-H^ 5 

SERT (transporter) 1 3 

D4 receptor (GPCR) 6 

noradrenaline transporter(transporter) 13 

Koradren receptor (GPCR) 20 

Other off-targets Amisulpride Antipsychotic D2 Antagonist 1 

Aripiprazole 5-HT1 A Agonist D3 Antagonist 1 

5-HT2A Antagonist D2 Antagonist 1 

Alcohol Deterrent Antiamyloidogenic Agent 
Antipsychotic Treatment of Cocaine Dependency 

Benperidol Antipsychotic 5-HT2A Antagonist 2 

D4 Antagonist 10 

Benzoclidine Antihypertensive Anxiolytic M3 Antagonist 2 

Bromperidol Antipsychotic 5-HT2A Antagonist 6 

Cabergoline Prolactin secretion inhibitor Dopamine Agonist 2 

Adrenoceptor (renoceptornist 2 

5-HT1D Agonist 1 

Captopril ACE Inhibitor Antihypertensive Cardiotonic Leukotriene A4 Hydrolase 6 

Inhibitor 

Carbacyclin Antithrombotic Prostaglandin 2 

Carvedilol Antianginal Antihypertensive Adrenergic ((3) Blocker 1 
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Table 3 Target ranking results of the 3NN scheme on the novel drug-target association set of Keiser et al. [23] 

(Continued) 



Enrofloxacin 


Antibacterial Quinolone 


DNA gyrase 


1 


Fluanisone 


Neuroleptic 


5-HT2A Antagonist 


8 


Hexoprenaline 


Bronchodilator 


Adrenergic ((3) Agonist 


1 


LinBzolid 


Antibacterial Oxazolidinone 


/V\/\0 A Inhibitor 


2 


LoratadinB 


Antihistamine Inflammatory Bowel Disease 
Rhinitis 


Farnesyl Protein Transferase 
Inhibitor 


382 


[Wlplnprnnp 


Npi irnlpntir 


S-HT7A Antanonkt 


1 


[wiptprpinlinp 

I V ICT ICTI y Ull 1 ICT 


Antiminrainp Vasodilator 

i\\ i li i i ii^jigii \\z vajuunaLui 


AHrpnnrpntnr (Ypnnrpntnr Vpk 


2 


Naftnnidil 


Prostatp DKorHprs 

1 I^OLaLtT LylO^I ^*3I O 


5-HT1A Antagonist 

n- piHrpnprnir Rlorkpr 

kj. anicriicTiyi^. uiw^-ixcri 


7 
2 


NaringBnin 


Antiulcerative Enzyme inhibitor Enzyme 
inhibitor (Histidine decarboxylase) 


Xanthine Oxidase Inhibitor 


16 


Nuvenzepine 


Antiulcerative 


M2 Antagonist 


1 


Pimozide 


Antipsychotic 


Anticholinergic, Ophthalmic 

S-HT7A Antanonkt 


3 

1 6 


Rabeprazole 


Antisecretory (gastric acid) Antiulcerative 


H + /K + -ATPase Inhibitor 


1 


Rispenzepine 


Antispasmodic 


M2 Antagonist 


1 


1 cLI dUcI IdZII It: 


Ml ixiuiy LIL 


U \ All Ldy Ul IISL 


D 


Tetraminol 


Antihypertensive Vasodilator 


Adrenoceptor (renoceptorsive 


2 


1 1 r^i r\ i 1 
UldpiUM 


Ul dUlcllciyiL DICJLKcl Al 1LII lyptrl Lcl ISIVc 


r |_|T1 A Ant^nrinict 
Jill 1 A Al 1 Ldy Ul IISL 


z 


\-.\\ IILdpi lUc liyyiuycll Idllldlc 


Al ILIUILcldLIVt: Olll 1 lUldl 1 1, rtrllSldlllL 


^ VAT A Anr^nict 
D n 1 ^+ Ay Ul IISL 


1 
I 


Lisurid6 mal6at6 


Antiparkinsonian Dopamine Autoreceptor 
Agonist Prolactin Secretion Inhibitor 


Adrenoceptor (renoceptorlact 


1 
I 


Methylphenidate 


Adrenergic Agents Adrenergic Uptake 
Inhibitors Central Nervous System Stimulants 
Dopamine Agents Dopamine Uptake Inhibitors 
Sympathomimetics 


M3 Antagonist 


3 


Pergolide mesylate 


Antiparkinsonian, Dopamine Agonist 


5-HT1D Agonist 
Adrenoceptor (renoceptorstni 


2 
1 


Propafenone hydrochloride 


Antiarrhythmic 


(3ntiarrhythm blocker 


3 


Terbinafine hydrochlorid 


Antifungal 


Squalene Epoxidase Inhibitor 


1 


Urapidil 


ar adrenergic Blocker Antihypertensive 


D2 Antagonist 


3 



Abbreviations: 5-HT Serotonin, D Dopamine receptor, HIV Human Immunodeficiency Virus reverse transcriptase, H Histamine, NMDAR N-methyl-D-aspartate receptor 
(glutamate receptor), SERT serotonin transporter, M Muscarinic acetylcholine receptor. 



drugs withdrawn from the market due to hERG toxicity 
[37]. For seven of these drugs, their interactions with 
hERG were predicted in the top 20 list. For three drugs, 
terfenadine, Sparfloxacin, Droperidol, their interactions 
with hERG were even ranked at the first place. For com- 
parison, the rankings of therapeutic targets of these 
drugs were also listed. We notice that all the on-target 
interactions of 9 drugs fall in the top 20 list. These re- 
sults highlight the usefulness of the 3NN scheme on 
identifying both on and off-targets. Particularly, the high 
ranking of hERG as a potential interacting target of a 
query compound may serve as a hERG toxicity alert for 
further safety investigation. 



Experimental 

1 . Reference set preparation 

While there are many public databases (ChEMBL 
[39], BindingDB [40], PubChem, etc.) storing bioactive 
small molecules and target information, there are no 
special collections for ligands of DRTs. Here, we build 
a collection of the active ligands from BindingDB for 
FDA-approved drug targets from DrugBank [41]. The 
detailed procedures for the data set preparation are as 
follows: 

(I) DrugBank provides a list of FDA-approved drug 
targets, among which all protein sequences of drug tar- 
gets of small molecules were downloaded. Sequences of 
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Table 4 Target ranking 


results of the 3NN scheme on nine drugs with hERG 


toxicity 




Drug 


Therapeutic target 


3NN ranking (TT) 


3NN ranking (hERG) 


Astemizole 


H1 receptor 


3 


2 


Cisapride 


5-hydroxytryptamine 2A/3/4 receptor 


5 


6 


Sertindole 


D2 receptor 


1 


2 




5-hydroxytryptamine 2A/2C/6 receptor 


4 






a-hydroxytr adrenergic receptor 


5-7 




terfenadine 


Histamine H1 receptor 


2 


1 




Potassium voltage-gated channel subfamily H member 2 


1 






M3 


5 




Sparfloxacin 


DNA topoisomerase 4 subunit A 


2 


1 




DNA gyrase subunit A 


3 






DNA topoisomerase 2a 


6 




Droperidol 


D2 dopamine receptor 


3 


1 




a2 adrenergic receptor 


5 




levomethadyl Acetate 


uAceta opioid receptor 


1 


27 




Neuronal acetylcholine receptor 


3 




Lidoflazine 


Calcium channel 


1 


21 


Terodiline 


Calcium channel 


47 


4 



Abbreviations: TT Therapeutic target, H Histamine, D Dopamine receptor, M Muscarinic acetylcholine receptor. 



protein targets deposited in BindingDB were used to cre- 
ate a local BLAST database with NCBI blast [42]. 

(II) The downloaded sequences from DrugBank were 
used to perform similarity search against the local 
BLAST database, to find drug-target related targets in 
BindingDB and to retrieve their interacting ligands. 
Using an E-value threshold (1E-50), we obtained target 
mapping between BindingDB sequences and drug target 
sequences from DrugBank. A protein target of Bin- 
dingDB exhibiting high homology with any of the drug 
targets was considered as a potential drug target. 

(III) The ligands were further filtered to eliminate 
those with weak binding affinity to a specific protein. 
The threshold for "active" ligand was set as IC 50 , Ki, I<a 
or EC 50 < 10 \iM, or AG <28.53 kj/mol. 

(IV) The above retrieved protein targets were redun- 
dant (i.e. there are identical proteins with different 
names), and some of them are highly homologous to 
each other (e.g. mutants or from different source organ- 
isms). To address the issue, we combined the proteins 
showing high sequence similarities by another round of 
BLAST searches, with a more stringent E-value thresh- 
old of IE- 120. All the active ligands of a "combined" 
protein target were pooled together. The resulting data- 
base contained 725 targets in all. 

(V) To ensure every target has a certain amount of lig- 
and representatives, we filtered those targets whose ac- 
tive ligands were less than or equal to 10. At last, our 
curated database covers 533 targets with 179,807 active 



ligands in total. Approved drugs are used as an inde- 
pendent test set for additional validation. 

This established chemical reference library is orga- 
nized according to DRTs, and each of them is repre- 
sented by a set of corresponding active ligands. In our 
reference library, the ligand set contains unique ligands 
for each target. All the data preparation procedures are 
performed with in-house Python scripts. The reference 
library is designed to enable further updating by adding 
new target-ligand interaction data. 

2. Validation sets preparation 

The following two datasets were used to test the target 
predicting performance of different approaches, including 
approved drugs from Therapeutic Target Database (TTD) 
[43], and approved drugs from DrugBank 3.0 [41]. These 
datasets contains drug or drug-like compounds and their 
protein target sequences. For each set, the small molecules 
existing in the reference library were firstly removed, and 
the sequences were mapped onto DRTs by similarity search- 
ing against the local BLAST database mentioned above. 

Conclusions 

With the rapid advancement of high-throughput screen- 
ing technology, the shear amount of bioassay data is so 
huge and increasing so fast that many traditional frame- 
works encounter difficulties on launching a large cam- 
paign of target fishing. The exploration of more efficient 
approach in the context of 'big data is needed for the 
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Collect active ligands from BindingDB 



Stepl: Prepare 
reference ligands library 



Query 
Compound 



Calculate the ECFP_4 similarity scores 
between query and active ligands 

For each target, combine the 
similarity scores of K references 



Step2: Similarity 
searching with data 
fusion 



Rank putative targets by 
fused scores 



Step3: Extract target 
profile of the query 



Figure 7 The flowchart of the ligand-based similarity-ranking scheme with data fusion. 



challenging task. In this study, we exploited a simple 
scheme using 2D fingerprint similarity ranking with a 
DF strategy to predict drug-relevant targets based on a 
reference library containing 533 targets with 179,807 ac- 
tive ligands. This scheme exhibits good performance on 
predicting both therapeutic and non-therapeutic targets 
for the approved drugs from DrugBank and TTD. It can 
also reproduce 62 out of 65 new drug-target associations 
identified by SEA, and successfully predict both on- 
target and off-target interactions for 9 drugs withdrawn 
due to hERG toxicity. Encouraged by the results, we ex- 
pect that the proposed scheme will enable large-scale 
target fishing, which is useful for both systematically 
identifying the new uses of old drugs and exploring the 
molecular basis of their adverse events. 

Methods 

1. Similarity fusion for target fishing 

2D fingerprint is one of the most widely used forms to 
represent the chemical structure in molecular similarity 
searching. Among various fingerprint algorithms, the 
extended-connectivity fingerprint (ECFP) is noteworthy 
due to its efficiency and the ability to capture highly spe- 
cific atomic information [44]. In this study, the ECFP4 
fingerprint was calculated by a component ("Molecular 
Fingerprints") implemented in Pipeline Pilot 7.5 [45]. 
Given a query compound, its similarity score to a target 
is represented by a set of reference ligands is obtained 
by fusing the pairwise fingerprint-based molecular simi- 
larities. The similarity is measured by the Tanimoto coef- 
ficient [46,47]. For a given target / with Nj reference 
ligands, the following scores are calculated by different 
similarity fusion schemes: 



(II) Max score (MSj) is a special case of KNN when K 
equals to 1, which only considers the most similar 
ligand of the target / to the query; 

(III) Centroid score (CSj) is the average similarity of Nj 
ligands of the target / to the query. 

Figure 7 outlines the target fishing workflow. The first 
step is to elaborately prepare a well-curated reference 
library that covers 533 targets represented by their active 
ligands as comprehensive as possible. Then, for a given 
query compound, 2D fingerprint based similarity search- 
ing runs through the entire sets, and the fusion scores of 
each target are calculated. Altogether four types of fusion 
scores were calculated, which are KSj MSj and CSj. Finally, 
for each fusion score, all the 533 targets were ranked in a 
descending order, and the top ranked targets were 
regarded as potential targets of the query. The predictive 
performances of different types of fusion score were com- 
pared with a 10-fold cross-validation test, in terms of the 
evaluation metrics defined in the next section. 

2. Evaluation metrics 

The metrics we used are defined below: 



PRh 
RE n 
F n = 2. 



TPn 

n 

TPn 

m 

PR n ■ RE n 



PR n + RE n 

i _ m _ 
PR 1 =-YPRi 

W1 ' 



(1) 

(2) 
(3) 
(4) 



(I) KNN score (KSj) is the average similarity of K most 
similar ligands of the target / to the query; 



In this study, PR n (eq. 1) means the fraction of positive 
predictions that are "true" (experimentally verified 
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targets) where TP n is the number of true positive predic- 
tion in the top ranked n targets; RE n {eq. 2) means the 
fraction of the "true" targets that can be recognized (pre- 
dicted as positive). Both PR n and RE n are therefore based 
on an understanding and measure of a model's ability to 
identify true targets. E n {eq. 3) is the harmonic mean of 
PR n and RE W and a higher F n score means a better per- 
formance on discriminating true targets based on an 
overall consideration. 

The PR' was introduced by Amini et al. [36] For every 
correctly predicted target that appears at the z-th pos- 
ition in the top m ranked targets, which corresponds to 
the number of true targets of the ligand, the precision 
value at that position PR t was calculated. PR' is given by 
the averaged precision values PR t from the ranking 
places 1 to m {eq. 4). According to this definition, the 
relevant targets that do not appear in the top m ranked 
targets receive a precision score of 0. In the end, the av- 
eraged values of the PR W RE m F n and PR' for all com- 
pounds of validation datasets were reported. 

Additional file 



Additional file 1: Table SI. The PR n and RE n values of 3NN and SEA for 

DrugBank and validation sets respectively. Table S2. The F n score of 
3NN and SEA for the DrugBank and sets respectively. 
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